In this last of SUSA’s crash courses on introductory R programming, we will round out the data manipulation skills learned in r2 (on data cleaning, which, if you recall, captured \(\sim 80\%\) of data science) with the fun part - data analysis! In this tutorial, you will learn how to use ggplot2, R’s premier data visualization library, to conduct EDA (exploratory data analysis) and display data in visually powerful ways. You will also learn our first machine learning algorithm, linear regression, a classical method in statistical analysis. You will learn how to verify the assumptions of the linear regression model, how to interpret its results, and even how to use and tune more complex regressions like ridge regression and polynomial regression.
The immediate prerequisite this tutorial is r2, which covers several tidyverse packages used for data cleaning. You will also need to install both R and RStudio to use the r3-workbook associated workbook. Visit r0 for general information on the philosophy and functionality of R and RStudio, as well as installation guides for both.
This document contains textbook-style information on R programming. It will cover the essentials of data visualization with ggplot2 and an introduction to statistical analysis and inference with an overview of linear regression in R.
Throughout this tutorial, you will be working with three distinct datasets, to give you familiarity and practice with the tidyverse. These datasets are as follows:
iris, the Edgar Anderson’s classical multivariate dataset on 150 flowers in a Canadian field. This dataset will be used to illustrate some of the ggplot functions.diamonds, a dataset containing the prices and various attributes of almost 54,000 diamonds. This dataset will be used to illustrate some of the ggplot functions.mpg, a subset of EPA’s dataset on fuel economy, including the hwy fuel efficiency and other attributes of 234 automobiles from 1999 to 2008. This dataset will be used in the two mini-projects in r3-workbook.The r3-workbook contains associated exercises to work through as you learn about the concepts within this document. They are aimed to help you get practice and familiarity with R programming concepts and functions. At the end of each section of this document, solve the problems in the matching section of the workbook to help your understanding of the material.
What is the purpose of data science anyway? I don’t mean to get all philosophical on you, but it is important to have an understanding of the objectives of the data science paradigm as you learn data science techniques.
While you’d get different answers from different data scientists, here are the three main objectives that I have in the data science projects I embark upon:
The difference between your Statistics classes and data science in practice is that your Statistics classes are primarily focused on Objective II and Objective III. For example, STAT 135 & STAT 151A concern themselves primarily with statistical inference in the form of confidence intervals, hypothesis testing, and regression coefficient interpretation. STAT 154 and machine learning in general concerns itself with classification and prediction of new data points given prior data.
In contrast, this workshop is concerned primarily with Objective I. Humans are visual creatures, and so while a computer is most able to interpret bytes of numbers and texts, humans would rather see data visually! A data scientist or statisticians requires the trust of their coworkers and managers and the public at large. Tables can be an information overload, and model paramatizations may be foreign to your audience. Graphs are one way we connect our data to the people who care about them. The other is statistical reporting, which can prove to be difficult without practice simplifying results without losing precision or accuracy.
Every data scientist relies on the audience’s understanding of their work, whether their project manager or a client, and so effective data visualization and reporting skills are essential when presenting your findings. Even for a data scientist working on private research, data visualization is a powerful tool to view various aspects and relationships within the data, allowing for a deeper understanding of the underlying structure of data.
Some examples of structures that are easily detectable by humans in graphical displays of data are:
- Outliers
- Clusters
- Relationships between variables
- Timed processes via animated plots
- Spacial distributions via geolocational visualizations
As you will see in the second part of this tutorial, models like linear regression come with a host of assumptions - some of which are most rapidly verified graphically. While the linear regression model is powerful, it doesn’t work in all cases. Blindly applying models to data without exploring the data visually first (EDA) can lead to flawed model design and misinterpretation of results. Once you have a suspicion about the structure of the underlying distribution and patterns in the data, you can then use analysis to either predict new data with known certainty, or inferential statistics to assign quantatative interpretations to these structures. Plus, plotting data is fun! You can quickly pluck out insights in the data in ways that aren’t as easily described quantitatively.
“The greatest value of a picture is when it forces us to notice what we never expected to see.” — John Tukey
One last thought about data visualization - it’s a craft, part technique and part creativity. Feel free to try various ways of displaying the same patterns in your behavior, and while there are some broad guidelines to effective visualization, a lot of it comes with instinctual creativity and practice.
ggplot2ggplot2 is R’s premier library for data visualization. Although its syntax is quite different than the rest of base R and most packages, there’s a method to its madness - ggplot2 is backed by a grammar of graphics that allows it to make any plot imaginable (within the constraints of its geoms). However, just because any plot is possible, doesn’t mean every plot is good… Let’s start by reviewing some essential tips for effective visual display of data.
First and foremost: tell a story! Data may seem very cold and neutral, but data are quantifications of behaviors in our world. Graphs are how we can make data less cold, and more interpretable for humans. Each graph should show something insightful, important, and easily detectable by the audience. Of course, the are a variety of ways to display the same patterns - the creative process of effective data visualization is choosing how to say the story best.
Some other tips for data visualization include:
- Minimize the data-ink ratio, or the amount of information you are showing per unit of “ink”. You want to only include visual information relevant to your point for that specific graph. If you can’t get rid of irrelevant details without the graph looking sparse, consider highlighting the patterns, trends, or outliers you want the audience to concentrate on with color, sizing, or labels.
- Include a (zero) baseline. Scales give the relative positions between various quantities, but if the audience has no idea what general context the numbers exist in, there’s no point in comparing them. Often times, this means including y = 0 or x = 0 in your axis, rather than line breaking, which can be seen as intentionally misleading.
- Communicative elements are a must. Whether this means simply axis and plot titles, rearranging your legend, or adjusting font sizes to show up better on a projector screen, it must be immediately obvious to a new viewer what your graph aims to display.
- Keep engrained preconceptions of colors in mind. For example, it is generally inadvisable to use bright red to indicate positive trends. Additionally, be wary of color-blindness, which affects over seven percent of the American populace. Finally, use differentiating colors to highlight factors, sequential colors to highlight trends, and diverging colors to highlight splits in behavior. The RColorBrewer package is an extension of ggplot2 that allows you to easily switch between color themes specially designed for data visualization.
* Order bars by height. Humans prefer things ordered, and it makes relative comparison between bars much more visible.
* Never use pie charts. This is decreed by the legendary Edward Tufte, one of the founders of data visualization, who noted its imprecision for more than a couple categories or dimensions. Usually, bar charts are just better. Also, stay away from double-axis graphs - unless you want to mislead someone…
“The only worse design than a pie chart is several of them,” — Edward Tufte
The gg in ggplot2 stands for “Grammar of Graphics”. ggplot2 is the implementation of this idea that all data visualizations decompose into specific components, and so if we have a way to manipulate each component and then sum them together, we can make any plot!
The components used in the layered grammar of graphics underyling ggplot2 are as follows:
1) a base dataset, and a set of mappings from variables to aesthetics
2) one or more layers, with each layer of glyphs having:
- one geometric object, or geom - (optionally) one statistical transformation
- (optionally) one position adjustment
- (optionally) an alternative dataset and set of aesthetic mappings
3) a scale for each aesthetic mapping
4) a coordinate system
5) (optionally) facet specification
6) communicative elements (labels, borders, titles, etc.)
The basic idea is that we first construct an empty ggplot by specifying the dataset and mappings, then add layers of geometric objects, then alternative scales, coordinates, or faceting if we wish to deviate from the default, and finally communicative elements for human audiences, such as labels, subtitles, and grid lines.
To make this abstraction a little more relatable, imagine you are tasked to draw a scatterplot by hand for a Statistics class. First, you would \((1)\) decide on what data you would be plotting, and then draw the relevant axes. Then, you would \((2)\) plot a layer of points onto your page, \((3)\) choosing your position by whatever scale you set for each axis. If you were going to either use \((4)\) another coordinate system or \((5)\) some sort of faceting, you would also need to draw the graph accordingly. The final step would be to \((6)\) add axis labels, a title, and other communicative elements.
As you could imagine, you could make up a similar recipe for any plot possible. Just like this analagy and the abstraction it illustrates, the syntactic structure of ggplot2 is meant to mimic the actual modular creation of plots. We’re going to drop this diagram here (notice the similarity to the grammar above!) for now, but we will be going into it in much more detail in the following section, on ggplot2 basics!
ggplot2ggplot, aesCreating a ggplot starts with constucting an ggplot object. If you check ?ggplot, you will see that ggplot has two arguments:
- data, the dataframe you wish to display
- mapping, a function call to aes, to make a mapping between aesthetics in the plot and variables in data
The aes (aesthetic mapping) function is used to tie visual elements of the plot to variables in the dataframe. The function header for aes is as follows:
aes(x, y, ...)
By default, the first argument (x) you supply to aes indicates which variable in your dataframe corresponds to the x-axis, and the second argument (y) indicates which variable corresponds to the y-axis.
However, you can also make mappings with, for example:
- col/color/colour, the colors of the geoms you are plotting
- fill, the filled color of the geoms you are plotting
- shape, used to map the shape of the points you are plotting
- linetype, used to map the linetype (e.g. dashed, dotted, etc.) of the lines you are plotting
- alpha, used to vary the transparency of the geoms by some variable
Although the “default” mappings can be set in the ggplot constructor, you can also set a different data and mapping argument in your specific geoms too.
An important warning here: aes is used to map aesthetics to variables, NOT to make declare constants. If you wished to set your geom to be red for example, you cannot use aes(col = "red"). You must instead declare col = "red" outside of the aes function call.
Let’s make a ggplot object for the iris, mapping Petal.Length to the x-axis and Sepal.Length to the y-axis:
ggplot(iris, aes(Petal.Length, Sepal.Length))
As you can see from the figure above, the output of the ggplot constructor function is a ggplot object. The x-axis is already labeled to map to Petal.Length, as is the y-axis for Sepal.Length - just as we had told the aes function.
However, the plot is still empty. Why? Because we haven’t supplied any more components to add to it! Specifically, we haven’t added any layers. In ggplot2 syntax, you can add layers and other components to a ggplot by literally using the addition operator, +. Since a ggplot object + any layer is still a ggplot object, we can use this modular approach to add as many layers and components as we want to a ggplot object.
geom_*Layers of geoms can be added to a ggplot object with one of the geom_* functions. As always, you can quickly glance over the documentation of each geom_* function we go over by checking e.g. ?geom_point. There are dozens of geom_* functions in ggplot2, so this text will focus on a few of them to illustrate the general ggplot2 syntax. You can easily find a visual listing of the geoms in ggplot2 in the official ggplot2 Cheatsheet.
geom_point, geom_jitter)First, let’s add on geom_point as a layer to turn our empty ggplot into an actual scatterplot. As a reminder, a scatterplot uses a layer of points to display the relationship between the x and the y variables in your data. In this case, we set our x to map to Petal.Length and y to map to Sepal.Length. So, by adding a single layer with point as the geom, we can visualize the relationship between Petal.Length and Sepal.Length in the iris dataset.
ggplot(iris, aes(Petal.Length, Sepal.Length)) + geom_point()
Neat! We can already see a positive, linear relationship between Petal.Length and Sepal.Length - a relationship that would have been difficult to notice by just reading through all fifty entries in the iris dataframe.
Sometimes you will encounter datasets that have a small number of discrete values for either the x variable or the y variable. These datasets initially seem like terrible candidates for geom_point, because the datapoints overlap too much to discern any patterns. For example, suppose we wanted to display the relationship between the cut of a diamond and its clarity, using the data in the diamonds dataset.
diamonds %>% ggplot(aes(cut, clarity)) + geom_point()
In our initial display, we cannot see any pattern whatsoever, as though there are multiple points for each combination of cut and clarity, but we have no information about how many points are in each. To solve this, and also to illustrate the modifiable nature of geoms in ggplot2, we can adjust the position option argument of geom_point to jitter, or add a small amount of randomness to, each point to spread them out a little.
ggplot(diamonds, aes(cut, clarity)) + geom_point(position = "jitter")
Already, we can see the underlying pattern of the data more clearly. A diamond with high clarity, such as IF or VVS1, is more likely to be of a high cut than a low one. Since the “jitter” option for geom_point is so commonplace, ggplot2 has a shortcut function for geom_point(position = "jitter"), geom_jitter.
To showcase the other options of geom_point, let’s also set the alpha to be a constant 10%, to allow us to see through dense clusters of points, and the size of the points to be 0.5.
ggplot(diamonds, aes(cut, clarity)) + geom_jitter(alpha = .1, size = .5)
Finally, note that since we are only plotting a single geom, we could (just for fun) move all of the “global” arguments for ggplot to be “specific” arguments for just the geom_jitter layer, and the graph would look identical:
ggplot() + geom_jitter(data = diamonds, aes(cut, clarity), alpha = .1, size = .5)
(NOTICE: the constant setting to alpha and size is done outside of aes)
Now, we can see even more facts about our data. For example, it is now visually obvious that most of the diamonds in the diamonds dataset are of VS2 clarity and Ideal cut.
geom_histogram, geom_area(stat = "bin"), geom_density)To display continuous univariate data (data of just one variable), ggplot2 has a variety of geoms. One common way to diplay the distribution of a continuous variable is via a histogram. A histogram consists of bins on the x-axis, and the counts of each bin as the y-axis.
To make a histogram in ggplot2, you just supply one variable to aes and use the geom_histogram geom. For example, suppose we wanted to visualize the distribution of the petal lengths of the fifty irises. Just to fit the theme, let’s also make the bars orchid-coloured.
iris %>% ggplot(aes(Petal.Length)) + geom_histogram(fill = "orchid")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
(NOTICE: We used fill instead of color here to modify the color of the filled bars instead their edges)
There seems to be a lot of flowers with noticably shorter petals than the rest. However, the low-resolution of the bars make it difficult to exactly see the shape of the distribution. A smoother way to display the counts of a single continuous variable is with a binned area plot, called binned because the y-axis again corresponds to the counts of each bin.
The geom_area(stat = "bin") function specification is used to create these “smoothened” histograms. Let’s recreate a smoother version of our petal length plot:
iris %>% ggplot(aes(Petal.Length)) + geom_area(fill = "orchid", stat = "bin")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Hmm… that’s still not very smooth. To instead see a Guassian-smoothed density plot, which approximates a more curvy shape for the distribution of some variable, use geom_density. Let’s make one last rendering of the petal lengths in the iris dataset:
iris %>% ggplot(aes(Petal.Length)) + geom_density(fill = "orchid")
geom_bar, geom_col)Now, suppose we want to graph the counts, not per bin of a continuous variable, but rather per value of a discrete variable. For example, let’s find out how many cars are from each manufacturer from the mpg dataset:
mpg %>% ggplot(aes(manufacturer)) + geom_bar()
Sometimes we want to plot some other variable than just the counts per value of a discrete variable. To plot other variables than the count for a discrete x value, use geom_col instead of geom_bar. For example, suppose we wanted to plot the average fuel economy for each car manufacturer. Let’s also make each bar a different color, just to make things more interesting.
mpg %>% group_by(manufacturer) %>%
summarise(`Average HWY Fuel Economy` = mean(hwy)) %>%
ggplot(aes(manufacturer, `Average HWY Fuel Economy`, fill = manufacturer)) +
geom_col()
(Check for understanding: why was fill used in the aes call rather than outside the aes call this time?)
Apparently Honda and VW are the build the most efficient engines. However, humans find it easier to read bars that are ordered in height. Sometimes, you might want to try doing this with your graphs, using dplyr’s mutate function:
mpg %>% group_by(manufacturer) %>%
summarise(`Average HWY Fuel Economy` = mean(hwy)) %>%
mutate(manufacturer = factor(manufacturer,
levels = manufacturer[order(`Average HWY Fuel Economy`)],
ordered = T)) %>%
ggplot(aes(manufacturer, `Average HWY Fuel Economy`, fill = manufacturer)) +
geom_col()
That looks a little clearer! Now that we’ve finished a several ways to display data directly, let’s discuss a couple of geoms designed to highlight trends in data.
geom_line, geom_smooth)Sometimes, you’re given a time series that is best presented as a line graph. For example, suppose we had a dataframe called stock_prices that tracked the valuation of some statistical consulting company. We could use geom_line to create a line graph of the stock price over time:
stock_price %>% ggplot(aes(Date, `Stock Price`)) + geom_line()
To make the line thicker, we can use the size option for geom_line. Remember, you can always check e.g. ?geom_line to see what optional arguments you can specify about your geoms.
stock_price %>% ggplot(aes(Date, `Stock Price`)) +
geom_line(size = 2)
To showcase the modularity of ggplot2’s components, let’s also add a layer of points to the line plot too. Additionally, I beleive it can be misleading to not include the 0 in your plots. To expand the axes of your plots, use expand_limits, as seen here:
stock_price %>% ggplot(aes(Date, `Stock Price`)) +
geom_line(size = 2) +
geom_point(size = 2) +
expand_limits(y = 0)
Of course, if we really wanted to, we could just go overboard and add even more geoms…
stock_price %>% ggplot(aes(Date, `Stock Price`)) +
geom_col(alpha = .3) +
geom_line(size = 2) +
geom_point(size = 2)
But that would clearly violate our dedication to a high data-ink ratio. It is better to use data visualization to highlight, rather than obsfucate, patterns and trends. Suppose we wanted to go back to our petal and sepal length scatterplot and more clearly highlight the relationship between the two variables. We could use geom_smooth, a geom that will add a smooth trend line to our visualization.
iris %>% ggplot(aes(Petal.Length, Sepal.Length)) +
geom_smooth(size = 2, alpha = .9) +
geom_point(alpha = .7, size = 4)
## `geom_smooth()` using method = 'loess'
To turn off the confidence band around the smooth line, you can specify se = FALSE. Additionally, if you’d rather fit a linear model rather than the default model (usually loess, which you can read about here), you can manually specify method = "lm". Finally, let’s add a bit more information to this plot by coloring our points by Species. Our final plot is constructed as so:
iris %>% ggplot(aes(Petal.Length, Sepal.Length)) +
geom_smooth(method = "lm", se = F, size = 2, alpha = .9, col = "black") +
geom_point(alpha = .7, size = 4, aes(col = Species))
This plot is relatively simple, but reveals several facts all at once:
1. Virginica are usually bigger than Versicolor, which are usually bigger than Setosa irises 2. There is a clear linear relationship between petal length and sepal length in this sample of irises 3l Virginica and Versicolor irises are noticably closer in size than the much smaller Setosa irises 4. Setosa irises tend to vary less in size than the other two species
facet_wrap, facet_grid)It’s useful to view all three species together to compare and constrast them, but if we wanted to fit a line through each species, by specifying color = Species as a grouping aethetic for both geom_point and geom_smooth, our plot would look a bit messier:
iris %>% ggplot(aes(Petal.Length, Sepal.Length, col = Species)) +
geom_smooth(method = "lm", se = F, size = 2, alpha = .9, aes(col = Species)) +
geom_point(alpha = .7, size = 4, aes(col = Species))
(NOTE: alternatively, you can specify aes(col = Species) in just the ggplot constructor)
Sometimes, it’s easier to read a lot of information if the graph is split into panels, one for each discrete case. The process of splitting your data into discrete panels is called facetting. In this case, let’s facet by the Species of the irises. The syntax for facet_wrap, which is used to facet by a single variable, is facet_wrap(~ VARIABLE). Observe the effect of facetting on this altered version of the plot above:
iris %>% ggplot(aes(Petal.Length, Sepal.Length, col = Species)) +
geom_smooth(method = "lm", se = F, size = 2, alpha = .9) +
geom_point(alpha = .7, size = 2) +
facet_wrap(~ Species)
You can also specify the number of rows/columns, and whether or not every facet should have the same scale or not. Check ?facet_wrap for more info. Here’s another version of the plot above:
iris %>% ggplot(aes(Petal.Length, Sepal.Length, col = Species)) +
geom_smooth(method = "lm", se = F, size = 2, alpha = .9, aes(col = Species)) +
geom_point(alpha = .7, size = 2, aes(col = Species)) +
facet_wrap(~ Species, ncol = 1, scales = "free_y")
In some other cases, we want to make a grid of panels, for every combination of two discrete variables. The syntax for facet_grid, the two-dimensional version of facet_wrap, is facet_grid(Y_VARIABLE ~ X_VARIABLE). Suppose we wanted to investigate how carat affected price, for each particular color and cut combination. One example plot to display this would be to have carat as the x-axis, price as the y-axis, and facet by color and cut, like so:
diamonds %>% ggplot(aes(carat, price, col = color)) +
geom_jitter(size = .5) +
geom_smooth(method = "lm", se = F, alpha = .5, col = "black") +
facet_grid(color ~ cut)
Perhaps interestingly, it appears that the relationship between the carat of a diamond and its value are similarily related regardless of the diamond’s color or cut. We will investigate the effect of the color and the cut of a diamond on the relationship between its carat and value near the end of this text.
labs, theme, scale_*_*)As described at the beginning of this workshop, communicative elements are essential for humans to be able to read and interpret your graph quickly. Words can direct the audience to notice such things as the purpose of a graph, the units of the variables, and more. Especially in presentations, labels are important because you rely on your audience understanding the graph within a few seconds of its display. To add labels to a plot, simply use the labs function:
iris %>% ggplot(aes(Petal.Length, Sepal.Length)) +
geom_smooth(method = "lm", size = 2, alpha = .9, col = "black", se = F) +
geom_point(size = 4, aes(col = Species)) +
labs(x = "Petal Length (cm)", y = "Sepal Length (cm)", title = "Relationship between Petal Length and Sepal Length",
subtitle = "You can make subtitles too!", caption = "And captions will appear here.")
If you want to modify parts of our plot, like how the title and labels appear, where the legend goes and more, you can use the theme function. This function has a literal tonne of parameters, so check ?theme, ?element_blank, ?element_line, ?element_rect, and ?element_text to view them all! I won’t go into detailed use of how to use theme, but here’s an example of some of the more common options:
iris %>% ggplot(aes(Petal.Length, Sepal.Length)) +
geom_smooth(method = "lm", size = 2, alpha = .9, col = "black", se = F) +
geom_point(size = 4, aes(col = Species)) +
labs(x = "Petal Length (cm)", y = "Sepal Length (cm)", title = "Relationship between Petal Length and Sepal Length",
subtitle = "You can make subtitles too!", caption = "And captions will appear here.") +
theme(plot.title = element_text(face = "bold", size = 16, hjust = .5),
legend.position = "bottom", panel.grid.minor = element_blank())
Finally, sometimes you may feel like choosing a different color scheme, for a variety of reasons, whether subjective or utilitary. In any case, use the scale_*_* functions (e.g. scale_fill_brewer, scale_color_gradient2, etc.) to easily select alternative color schemes. While I won’t go into detail here, you can read up on how to edit the scales of both your colors and your axis. There are also additional packages designed to make choosing themes in ggplot2 easier and more fun, like RColorBrewer, ggthemes, and my personal favorite collection of ggplot2 themes, ggthemr.
broomdiamonds %>% group_by(cut, color) %>% do(lm(., formula = price ~ carat) %>% tidy) %>% filter(term == "carat") %>% ggplot(aes(cut, color, fill = estimate)) + geom_tile() + scale_fill_gradient2()
There are two mini-projects for this tutorial, designed to give you practice with EDA and more advanced model selection. Find the full project specifications in the mini-project section of the r3-workbook.
Exploratory Data Analysis (EDA) is the process of using summary data and data visualization to get a sense of the underlying structures within your data before analysis. It’s a fairly creative process, and like most creative processes, the best way to learn how to do EDA effectively is with practice. In this mini-project, you will explore either either the diamonds or the mpg dataset, noting any unusual relationships or outliers.
Now that you’re familiar with graphing with ggplot2 and model selection with broom, try your skills at this fairly advanced graphing exercise with the mpg dataset. You will learn a new type of regression, elastic net regression, which has not one but two hyperparameters! You will use function definitions to grid search hyperparameters, construct a dataframe of various models for predicting the hwy of each car with broom, and finally graph your validation process with ggplot2.
This ends our textbook-style tutorial on data visualization with ggplot2 and an introduction into the first of our machine learning algorithms, linear regression with lm and broom. For more practice, check out the mini-project section of r3-workbook.
advr1This marks the end of our introductory series into R programming! Congratulations on making it this far. By now, you should be able to begin your own foray into data cleaning, data visualization, and data analysis. Stay tuned in two weeks for advr1, the SUSA crash course introduction to neural nets in R!
tidyverse and other packages more visual. The relevant sheets to this tutorial are:
dplyr and tidyr functions as applied to manipulating dataframesreadr and tidyr functions as applied to reading and cleaning data*apply family as well as purrrstringr package for text manipulationggplot2, check out:
ggthemrplotly::ggplotlyggmapsftidyverse packages, visit the official tidyverse package listing.